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Gene expression is determined by genomic elements called enliancers, wliicli contain sliort motifs bound by different 
transcription factors (TFs). However, Flow enliancer sequences and TF motifs relate to enliancer activity is unlcnown, and 
general sequence requirements for enliancers or comprehensive sets of important enhancer sequence elements have 
remained elusive. Here, we computationally dissect thousands of functional enhancer sequences from three different 
Drosophila cell lines. We find that the enhancers display distinct CK-regulatory sequence signatures, which are predictive of 
the enhancers' cell type-specific or broad activities. These signatures contain transcription factor motifs and a novel class of 
enhancer sequence elements, dinucleotide repeat motifs (DRMs]. DRMs are highly enriched in enhancers, particularly in 
enhancers that are broadly active across different ceil types. We experimentally validate the importance of the identified 
TF motifs and DRMs for enhancer function and show that they can be sufficient to create an active enhancer de novo from 
a nonfunctional sequence. The function of DRMs as a novel class of general enhancer features that are also enriched 
in human regulatory regions might explain their implication in several diseases and provides important insights into 
gene regulation. 



[Supplemental material is available for this article.] 

Enhancers (Banerji et al. 1981) or c/i-regulatory modules (CRMs) 
are genomic elements that regulate gene expression, thereby con- 
trolling development and physiology (Levine 2010). Enhancers 
function independently of their endogenous contexts, e.g., when 
placed upstream of a reporter gene (Banerji et al. 1981; Doyle et al. 
1989; Visel et al. 2009; Kvon et al. 2012), arguing that the in- 
formation required for their activity resides within their DNA se- 
quences (Yafiez-Cuna et al. 2012). However, how enhancer sequences 
relate to enhancer activity is unknown and has remained one of the 
most important and attractive open questions in today's biology. 

Enhancer sequences contain short DNA sequence motifs that 
serve as binding sites for transcription factors (TFs), and the com- 
bined regulatory cues of all bound TFs determine an enhancer's 
activity (Small et al. 1992; Spitz and Furlong 2012). However, 
which TF motifs or combinations of motifs are required has 
remained elusive, and predictions of enhancer activity from the 
enhancer sequence or its motif content still remain challenging 
(Berman et al. 2004; Yanez-Cuna et al. 2012). In addition, en- 
hancers with similar functions can have different motif content or 
TF binding patterns, questioning even the existence of general 
rules or a "regulatory code" (Brown et al. 2007; Zinzen et al. 2009). 

Combinations of motifs that are sufficient for enhancer func- 
tion are unknown, suggesting that even our understanding of the 
t3q3es or identities of important sequence elements might be in- 
complete. Indeed, for example, a recent survey of putative regula- 
tory regions in the human genome has led to the discovery of many 
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previously unknown sequence motifs (Thurman et al. 2012). A 
comprehensive understanding of enhancer sequence elements is an 
important goal, as it would allow the functional interpretation of 
noncoding sequence mutations and their impact on gene expres- 
sion and disease. Such mutations have recently been shovm to be 
relevant for genetic diseases such as Polydactyly (e.g., Shh limb en- 
hancer; Sagai et al. 2005) and cancer (Sur et al. 2012; Huang et al. 
2013). It is also important as a complete set of sequence elements 
can allow the prediction of novel enhancers by searching for regions 
in which such elements are enriched, occur in certain arrangements, 
or are evolutionarily conserved (Berman et al. 2002; Markstein et al. 
2004; Hallikas et al. 2006; Warner et al. 2008; Aerts et al. 2010). 

Most of our current understanding about enhancer sequences 
has come from systematic mutational analyses of individual en- 
hancers such as even-skipped stripe 2 (Small et al. 1992), sparkling 
(Swanson et al. 2010), or the interferon-p [IFN-p] enhanceosome 
(Thanos and Maniatis 1995), and such tests have recently been 
scciled up substantially by the use of transcriptional reporter sys- 
tems and sequence barcodes (Kwasnieski et al. 2012; Melnikov 
et al. 2012; Patwardhan et al. 2012; Kheradpour et al. 2013). A 
promising alternative is the statistical sequence analysis of large 
sets of independent sequences with identical or similar functions 
(Roth et al. 1998; Yafiez-Cuna et al. 2013). This approach is based 
on the assumption that shared functions stem from shared se- 
quence features, which can be identified by means of their statis- 
tical overrepresentation (for reviews, see Stormo 2000; Hardison 
and Taylor 2012). It has been applied frequently and successfully, 
e.g., to proximal promoters (Roth et al. 1998), TF-binding sites 
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(Heinz et al. 2010; Lin et al. 2010; Yanez-Cuna et al. 2012) and TF- 
bound enhancers (Smith et al. 2013; White et al. 2013), sets of co- 
regulated genes (Philippakis et al. 2006; Elemento et al. 2007; Aerts 
et al. 2010), and to putative enhancers predicted based on co-factor 
binding, histone marks, or DNA accessibility (Narlikar et al. 2010; 
Lee et al. 2011; Burzynski et al. 2012). The application of this ap- 
proach to large sets of enhancers that are active in different cell 
types has the potential to reveal novel enhancer features, but has 
not yet been possible because only a small number of active en- 
hancers have been identified and functionally characterized In any 
given cell type or species. 

Here, we computationally dissect the sequences of thousands 
of active enhancers from three Drosophila cell lines. We demon- 
strate that cell type-specific enhancer function can be predicted 
with high accuracy from enhancer sequences and identified 
a novel class of general enhancer features, dinucleotide repeat 
motifs (DRMs). DRMs are required for enhancer activity in cell 
type-specific enhancers and high numbers are characteristic for 
enhancers with broad activity. We draw a general model for en- 
hancer activity that Incorporates TF and repeat motifs and present 
a set of motifs that is sufficient for enhancer activity. 

Results 

Thousands of enhancers with cell type-specific activity 

We selected three established Drosophila melanogaster cell lines: 
hematopoetic S2 cells derived from late embryos (Schneider 
1972), neuronal BG3 cells from larvae (Ui et al. 1994), and 
ovarian somatic cells (OSCs) from adult ovaries (Saito et al. 
2009). This selection comprises cell types derived from different 
tissues and different stages during the Drosophila life cycle 
and exhibiting distinct gene expression profiles (Cherbas et al. 
2011). We previously identified active enhancers in S2 cells and 
OSCs using STARR-seq, a genome-wide activity-based enhancer 
screening method (Arnold et al. 2013) and now performed 
STARR-seq screens in BG3 cells (Supplemental Fig. SI), revealing 
a total of 14,280 active enhancers, of which thousands were 
detected in only one of the three cell types and 814 In all three 
cell types (Fig. 1 A,B). 

To determine the sequence features required for cell type- 
specific enhancer activity, we defined four stringent groups of 500 
enhancers each: the 500 strongest enhancers of each cell type that 
were not active (STARR-seq P > 0.1) in any of the other two cell 
types (cell type-specific enhancers) and 500 enhancers that were 
strongly active in each of the three cell types (STARR-seq enrich- 
ment 2:2-fold, P < 0.001; broadly active enhancers). In addition, 
we defined a control set of 1000 sequences that had an identical 
genomic distribution to that of the active enhancers (i.e., mainly 
Intronlc and Intergenlc) but were inactive In all three cell types 
according to STARR-seq (Arnold et al. 2013). 

As expected for enhancers that cire active in their endogenous 
cellular context (Arnold et al. 2013), the cell type-specific activity 
of the enhancers in each of the four groups was reflected by the 
expression levels of the neighboring genes as measured by RNA-seq 
(Fig. IC). For example, genes next to S2-specific enhancers were 
specifically expressed in S2 cells but not in BG3 cells or OSCs, and 
the equivalent was true for BG3- or OSC-specific enhancers. Genes 
next to broadly active enhancers were expressed at similar levels in 
all three cell types and enriched in gene ontology categories related 
to housekeeping functions such as cytokinesis, cell division, and 
metabolic processes (Supplemental Fig. S2). Enhancers from all cell 



types showed a similar overall genomic distribution (Supplemental 
Fig. SI; Arnold et al. 2013), with a slight enrichment of broadly 
active enhancers near transcription start sites (TSS) compared with 
cell type-specific enhancers (Supplemental Fig. S3 A). 

These defined sets of sequences with functionally charac- 
terized enhancer activity across three different cell types to- 
gether with a large set of experimentally tested negative con- 
trol sequences constitute an unprecedented resource to study 
the sequence features that underlie cell type-specific enhancer 
activity. 

Sequence motifs are differentially enriched and predictive 
for cell type-specific enhancer activity 

STARR-seq measures the enhancer activity of defined DNA frag- 
ments in the constant sequence environment of a reporter plasmid 
(Fig. 2A) and thus is independent of the fragments' genomic 
contexts and chromatin states (Arnold et al. 2013). We therefore 
hypothesized that all functional differences between the four 
classes of enhancers are determined by the underlying enhancer 
sequences via defined sequence features, which we should be able 
to discover by sequence analysis. 

To identify the sequence features responsible for enhancer 
activity of each of the four groups, we established a rigorous cross- 
validation protocol in which we used distinct subsets of en- 
hancers for motif discovery, motif enrichment analysis, and pre- 
dictor training and evaluation (Fig. 2B; Methods). This avoids 
circular reasoning and prevents overfitting, which have been 
prevalent problems during regulatory sequence analyses (e.g., 
Yuan et al. 2007). 

We found known TF motifs, computationally identified mo- 
tifs (Stark et al. 2007), and de novo discovered motifs enriched In 
enhancer sequences of each functional class compared with con- 
trol regions that have the same genomic distribution but are 
inactive in all three cell types (Fig. 2C). Interestingly, the motif 
enrichments were not uniform but showed clear differences be- 
tween the four enhancer classes, which is emphasized when 
comparing the sequences of each class against sequences from the 
other three classes (Supplemental Fig. S4). 

In fact, this differential motif distribution was sufficient to 
predict the functional classes for enhancers solely based on their 
sequences with high accuracy (Fig. 2D): We were able to correctly 
classify between 74.5'K) and 81.0% of all enhancers against nega- 
tive controls (AUCs 0.80-0.90) and between 63.9% and 71.6% 
against the union of positive enhancers from the respective other 
classes (AUCs 0.67-0.79) using a support vector machine (SVM) 
and leave-one-out cross-validation (LOOCV) (Supplemental Fig. 
S5; Yafiez-Cuna et al. 2012; see Supplemental Table SI for pre- 
diction performance using other cross-validation schemes). Sim- 
ilarly, broadly active enhancers could be discriminated from en- 
hancers that are active In two out of the three cell types, 
suggesting that they constitute a distinct class of enhancers 
(Supplemental Table S2). 

This shows that the difference in motif content is sufficient 
for the correct discrimination of enhancer activity patterns from 
enhancer sequences. Furthermore, these results also suggest that 
sufficiently many different enhancers in each class share the same 
characteristic motif content such that c/s-regulatory motif signa- 
tures can be learned from some enhancers to predict novel and 
unseen enhancers in a cross-validation setting. Successful pre- 
dictions would not be possible if different enhancers were active 
due to entirely different motifs. 
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Figure 1. STARR-seq identifies enhancers witfi different cell type activity profiles. (A) Venn diagram of the STARR-seq enhancers according to their 
activity in three different Drosophila cell types: S2 (blue), OSC (red), and BC3 (yellow). (B) UCSC Genome Browser screenshot for examples of S2-specific, 
OSC-specific, BG3-specific, and broadly active enhancers. (C) The expression levels of genes neighboring enhancers from each of the four enhancer classes 
(quantile-normalized RPKM [reads per kilobase exon model] values; Wilcoxon P-values) in S2 cells, OSCs, and BG3 cells. Black shows a negative control 
with genes neighboring randomly chosen regions that were inactive in all three cell types, according to STARR-seq. 



Cell type-specific motif features 

Among the most differentially enriched and discriminative 
motifs for the three cell type-specific enhancer classes, we found 
motifs of known TFs. This included, for example, GATA and E- 
box motifs for S2-specific enhancers. Chorion factor 2- (Cf2) and 
Pointed- (Pnt) like motifs for BG3-specific enhancers, and 
Forkhead- (Fkh) and Traffic jam- (Tj) like motifs for OSC-specific 
enhancers (Fig. 2C; Supplemental Fig. S4). The GATA motif is 
recognized by the TF Serpent (Srp) which is known to be re- 
quired in S2 cells (Reborn et al. 1996; Riimet et al. 2002) and the 
E-box motif can be recognized by Twist, a master regulator of 
early mesoderm development expressed in S2 cells (Arnold et al. 
2013). Similarly, Tj (a Maf TF) is necessary for the development 
of OSCs (Li et al. 2003), and is a well-known OSC-specific marker 
gene (Saito et al. 2009). 

To test if GATA and E-box motifs are important for enhancer 
activity in S2 cells, we performed luciferase assays for three S2 
cell-specific enhancers (Fig. 3A). All three wild-type enhancers 
were active in S2 cells, but when we mutated either the GATA 
motifs or the E-box motifs, the activity of all enhancers dropped 
substantially. This suggests that both types of motifs are indeed 
functionally important for S2 cell-specific enhancers. In contrast, 
individually mutating three different additional motifs that are all 



predicted to be nonfunctional did not alter the enhancer activity 
(Supplemental Fig. S6). 

Similarly, mutating the Fkh-like motifs in three OSC-specific 
enhancers substantially decreased or abolished the enhancer ac- 
tivity for all three enhancers. This suggests that the computa- 
tionally predicted and strongly differentially enriched Fkh-like 
motif is required for enhancer function in OSCs (Fig. 3B). 

The results show that enhancers with shared cell type-specific 
functions share sequence motifs that can be identified computa- 
tionally, are predictive, and are required for function. 

Dinucleotide repeat motifs are required for broadly active 
enhancers 

In addition to the cell type-specific TF motifs discussed above, we 
found a set of motifs enriched in all four enhancer classes and re- 
quired for successful enhancer predictions. This set included some 
motifs of broadly expressed activators (e.g., AP-1 and STAT) and 
also motifs that consisted of the repeated dinucleotides CA, GA, or 
CG (i.e., all possible dinucleotides except TA) (Fig. 2C; Supple- 
mental Figs. S7, S8), which we term dinucleotide repeat motifs 
(DRMs). Interestingly, while the activator motifs and DRMs were 
substantially enriched in all three cell type-specific enhancer 
classes compared with negative controls, they were even more 
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Figure 2. Enhancer classes display differential motif content that is predictive. (A) Cartoon of STARR-seq highlighting that a genomic-wide library of 
candidate fragments is tested for enhancer activity in a constant sequence environment (for details, see Arnold et al. 201 3). (6) Definition of non- 
overlapping enhancer subsets for subsequent parts of the analysis (schematic). (LOOCV) Leave-one-out cross-validation (see Methods for details). (C) 
Heatmap showing motif enrichments in four enhancer classes compared with negative control regions (neg). Shown are only motifs with significant 
enrichments in at leastoneof the four enhancer classes (FDR adjusted P-value <0.01 and fold enrichment s2); matrix cells with nonsignificant enrichment 
values (FDR adjusted P-value :~0.01) are shown in white. (D) Receiver operating characteristic (ROC) plot for binary enhancer classification of all four 
enhancer classes versus the negative (dark colors) and positive (light colors) control sets using LOOCV. ([AUC] Area under the ROC curve.) Controls (gray) 
were performed by randomizing the sequences' assignments to the enhancer or control groups (see Yanez-Cuna et al. 2012). 



highly enriched in broadly active enhancers (Fig. 4A; Supple- 
mental Figs. S7, S8A,B). In fact, they were the most important 
features to computationally discriminate broadly active from cell 
type-specific enhancers (Supplemental Fig. S5). The increased en- 
richment of DRMs in broadly active enhancers was due to both an 
increased number of nonoverlapping DRMs and longer stretches 
of clustered and overlapping DRMs (Supplemental Fig. S8B,C). 



CA, GA, and CG (but not AT) type DRMs showed increased evo- 
lutionary conservation in all enhancer classes compared with 
negative regions (Supplemental Fig. S9), and their increased 
abundance and lengths, particularly in broadly active enhancers, 
were also observed in orthologous genomic regions in D. yakuba 
and D. pseudoobsaira (Supplemental Fig. SIO). This suggests that 
DRMs might constitute a novel class of general enhancer features 
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Figure 3. Predicted transcription factor motifs are important for cell type-specific enhancer activity. 
(A) Luciferase (Luc) assays in S2 cells of wild-type (light blue) and CATA or E-box motif mutant (dark 
blue) sequences of three S2-specific enhancers. (6) Luciferase assays in OSCs of wild-type (light red) and 
Fkh motif mutant (dark red) sequences of three OSC-specific enhancers. Neg denotes a negative control 
sequence used for normalization (Arnold et al. 201 3), and error bars indicate standard deviations of at 
least three independent biological replicates. Shown are P-values from unpaired t-tests. 



Supplemental Fig. S8), we were wonder- 
ing about the role of cell type-specific TF 
motifs (e.g., GATA) in broad enhancers. 
We therefore mutated the GATA motifs 
in one of the broadly active enhancers 
and found a specific 1.7-fold decrease of 
enhancer activity specifically in S2 cells 
in which these motifs appear to be im- 
portant (see above). As expected, we did 
not observe a drop in activity in OSCs 
that appear to have different motif re- 
quirements (Fig. 4E). Notably however, 
the remaining activity in S2 cells is still 
1.6-fold (P < 0.0017) above the activity 
when the GA DRMs are mutated. This 
suggests that the activity of broad en- 
hancers can be modulated by cell type- 
specific TFs via their sequence motifs. 



(Fig. 2C) and that a high number of DRMs might allow broad 
enhancer activity across several cell types. 

To validate the functional importance of DRMs, we selected 
three broadly active enhancers and assessed their activities by lucif- 
erase assays in both S2 cells and OSCs (Fig. 4B,C). While all three 
enhancers were active in both cell types, the activity of all mutant 
variants in which the GA DRMs were mutated was strongly reduced 
(Fig. 4B). Similarly, mutating the CA DRMs in the two enhancers that 
contained such motifs disrupted the activity of two out of two en- 
hancers in OSCs and of one out of two enhancers in S2 cells (Fig. 4D). 
For the second one, the activity was enhanced in S2 cells, presumably 
because the mutations created another functional motif or because 
the CA DRM can exert both activating and repressing functions as is 
also known for other motifs and TFs (Bauer et al. 2010). 

Our results show that the GA and CA dinucleotide repeats are 
indeed required for the activity of broad enhancers in S2 cells and 
OSCs, as predicted by the computational analyses. Similarly, the 
AP-1 motif was also required for broad enhancer activity as pre- 
dicted (Supplemental Fig. Sll). 



A set of motifs that is sufficient 
for enhancer activity 

Our results suggest a general model for 
the sequence requirements of transcrip- 
tional enhancers, in which DRMs are re- 
quired for enhancer activity and can be sufficient if present in high 
numbers, as is the case for broadly active enhancers. To test this 
model and the putative sufficiency of a defined set of motifs for 
enhancer activity directly, we copied the GA, CA, and AP-1 motifs 
from one of the broadly active enhancers (BA-2) into an inactive 
sequence. The motifs indeed conferred enhancer activity to the 
neutral sequence and the resulting synthetic enhancer was active 
in both S2 cells and OSCs. Notably, in OSCs the synthetic en- 
hancer was as strong as the original enhancer (BA-2), while in S2 
cells BA-2 was stronger, presumably because it also contains GATA 
motifs not included in the synthetic enhancer (Fig. 4F). Enhancer 
activity was also observed for two out of three additional synthetic 
enhancers derived from other broadly active enhancers (Supple- 
mental Fig. S12). Importantly, this included one derived from BA- 
3, from which we exclusively copied DRMs but no additional 
motif. This suggests that the motifs we discovered indeed carried 
the necessary features for enhancer function with full enhancer 
activity in S2 cells likely depending on additional cell type-specific 
modulatory sequences. 



DRMs are also required for cell type-specific enhancer activity 

Based on the observation that DRMs were also enriched in cell 
type-specific enhancers, we speculated that they might constitute 
more generally important enhancer features. We therefore mu- 
tated CA and CG DRMs in an S2-specific enhancer that contained 
such motifs and found that the mutation of either of the two DRMs 
indeed strongly impaired the enhancer activity (Fig. 4D). 

TF motifs modulate the activity of broadly active enhancers 
in a cell type-specific manner 

The results so far suggest that DRMs are needed for enhancer ac- 
tivity and that they might be sufficient if present in high numbers, 
as suggested by the cell type independent activity of broadly active 
enhancers. Despite the enrichment of DRMs in broadly active 
enhancers compared with cell type-specific enhancers and the 
prediction that this might explain their broad activity (Fig. 4A; 



Discussion 

Here, we made use of large sets of cell type-specific enhancers for 
three different Drosophila cell types identified by STARR-seq, a ge- 
nome-wide enhancer activity assay (Arnold et al. 2013). Impor- 
tantly, the genome-wide enhancer activity maps obtained by 
STARR-seq also allowed us to define experimentally validated in- 
active control regions, which are often not available. Computa- 
tional sequence analyses revealed that cell type-specific and 
broadly active enhancers showed strong differential enrichment of 
TF motifs and DRMs. These cfs-regulatory motif signatures were 
predictive for the different functional enhancer classes in strictly 
cross-validated settings. This indicates that enhancers with com- 
mon function share characteristic sequence motifs and we could 
indeed validate these motifs' functional importance by motif dis- 
rupting mutations. 

Our results emphasize an important property of transcrip- 
tional enhancers: Several motifs are required for enhancer func- 
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Figure 4. Dinucleotide repeat motifs are enhancer features required for activity. (4) Distribution of dinucleotide repeat motif (DRM) occurrences for the 
GA DRIVI TrI/IVIEI 37 (left) and the CA DRM IVIotI 5 (right) in negative regions (gray), cell type-specific enhancers (blue, red, yellow), and broadly active 
enhancers (purple). The v\/hiskers denote the 10th and 90th percentiles and Wilcoxon P-values are shown. (6-Q Luciferase (Luc) assays in S2 cells (blue) 
and OSCs (red) for three broadly active enhancers and DRM mutant variants. Shown are wild-type (light colors) and mutant (darl< colors) sequences, in 
which GA DRMs (6) or CA DRMs (C) are disrupted. (D) Luciferase assays for disrupting the CA and CG DRMs in the cell type-specific enhancer S2-1 . (f) 
Luciferase assays for disrupting the GATA motif in two broadly active enhancers. (B-E) Neg denotes a negative control sequence used for normalization 
(Arnold et al. 201 3), and error bars indicate standard deviations of at least three independent biological replicates. Shown are P-values from unpaired t- 
tests. (F) Luciferase assays in S2 cells (blue) and OSCs (red) of a synthetic enhancer (syn) for which the GA and CA DRMs and the AP-1 motif were copied 
from a Broad enhancer (BA-2) into an inactive genomic region (B.bone, backbone), while preserving their orientation and spacing. The activities of BA-2 
and B.bone are shown as controls. 
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tion, but none of these motifs is sufficient on its own. This strict 
motif cooperativity allows a limited number of motifs (and the 
corresponding regulators) to establish the many different regula- 
tory programs and cell types we find in complex organisms. Based 
on our results, we suggest a general model for enhancer sequence 
requirements (Fig. 5A) in which two different types of sequence 
elements are important: (1) motifs for cell type-specific TFs (GATA, 
E-box, or Fkh motifs) or broadly expressed activators (AP-1 or 
STAT), and (2) CA, GA, and CG DRMs. 

DRMs are novel enhancer features that are enriched and 
required in cell type-specific enhancers, in which they are 
complemented by additional motifs for cell type-specific TFs. A 
high number of DRMs is the main feature of broadly active 
enhancers, suggesting that they might be sufficient for broad 
enhancer activity across several cell types. DRMs are enriched in 
many independent (nonhomologous) enhancer sequences and 
we validated their functional requirement for four different 
representative sequences (Fig. 4B-D). We therefore conclude 
that our findings likely apply to enhancers more generally and 
that DRMs constitute general enhancer features. Our results are 
specific to the motifs described here, as both the mutation of 
three different control motifs that were not important accord- 
ing to our analysis did not impair enhancer function (Supple- 
mental Fig. S6). In addition, unlike the CA, GA, and CG DRMs, 
the fourth possible dinucleotide repeat sequence TA is depleted 
from enhancer sequences (Supplemental Fig. S8) and does not 
show increased evolutionary sequence conservation (Supple- 
mental Fig. S9). 

The DRMs described here might be bound by broadly ex- 
pressed TFs such as Trithorax-like (Trl), which is ubiquitously 
expressed and recognizes GAGA motifs. Trl is known to interact 



with nucleosome remodeling factors to restructure the chromatin 
(Tsukiyama et al. 1994; Xiao et al. 2001) and has been associated 
with accessible regions characteristic of active enhancers (Farkas 
et al. 1994). Alternatively, DRMs might function by directly 
influencing the DNA structure (e.g., major and minor groove 
shapes or bendability/flexibility) (Htun and Dahlberg 1989), 
triple-stranded DNA formation (Espinas et al. 1996), or nucleo- 
some occupancy (Struhl and Segal 2013), and thus chromatin 
properties more generally. The CA DRM for example — for which 
no TF is known — was found to be highly conserved between dif- 
ferent Drosophila species (Elemento and Tavazoie 2005; Stark et al. 
2007), and the CA, GA, and CG DRMs are enriched in TF-binding 
sites and highly occupied target (HOT) regions, which display high 
TF-binding complexity, are associated with decreased nucleo- 
some density, and function as enhancers (Li et al. 2008; The 
modENCODE Consortium et al. 2010; Kvon et al. 2012). In- 
terestingly, GA and CA dinucleotide repeats were also enriched in 
HOT regions in C. elegans (Kvon et al. 2012), even though C. elegans 
does not have a known Trl homolog (Tsukiyama et al. 1994), sug- 
gesting that the motifs could have a generally important role in 
a wide range of enhancers across species, potentially independent of 
sequence specific TFs. In addition, mouse retina enhancers have 
been reported to be GC-rich (White et al. 2012), consistent with the 
GC content of the DRMs described here. This suggests that DRMs are 
an important feature of enhancers more generally and across dif- 
ferent species, including mammals. 

Indeed, we found all three types of DRMs (but not TA di- 
nucleotides) enriched in human regulatory regions as defined by 
DNase I hypersensitivity (Thurman et al. 2012) or by H3K4mel 
and H3K27ac histone marks (Fig. SB; The ENCODE Project Con- 
sortium 2012). Furthermore, using lucif erase assays in HeLa cells, 



A. B. 



DHS H3K4me1 & H3K27ac 




CG CA GA TA CG CA GA TA 




^ ^ ^ cT <5^ 



HsEnh-1 HsEnh-2 HsEnh-3 

Figure S. Model of cell type-specific and broadly active enhancers. (A) Model for motif requirements of enhancer sequences. DRMs and motifs for cell 
type-specific TFs such as GATA, E-box, or Fkh motifs are required for cell type-specific enhancer activity (middle). Broadly active enhancers contain a higher 
number of DRMs, as depicted by DRMs" (bottom). These differences in motif content are sufficient to discriminate between enhancer classes and between 
enhancers and negative regions (top). (6) Enrichment and depletion of DRMs in human regulatory regions defined by DNase I hypersensitive sites (DHS) 
(left) (Thurman et al. 201 2) and by H3K4me1 and H3K27ac (right) (The ENCODE Project Consortium 201 2). (C) Luciferase assays in HeLa cells of three 
human enhancers (light green) and variants in which all instances of one type of DRM are mutated (dark green). Error bars indicate standard deviations of 
three independent biological replicates. Shown are P-values from unpaired t-tests. 
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we confirmed the Importance of the DRMs for enhancer function 
for two out of three different enhancers (Fig. 5C). This suggests that 
DRMs might be important for enhancer activity across a wide 
range of species, including flies and humans. 

Our demonstration that DRMs are generally important for 
enhancer activity has far-reaching implications for our under- 
standing of genome sequences and their functions and of gene 
regulation during development and disease. It argues that func- 
tionally important genomic elements might be missed when di- 
nucleotide repeats are masked during genome sequence analyses 
and that sequence elements involving DRMs might have been 
disregarded. Dinucleotide repeat-lil<e motifs have indeed been 
detected before, e.g., enriched in regions bound by various regu- 
latory proteins (Li et al. 2008; The ENCODE Project Consortium 
2012; Wang et al. 2012) but appeared to have received little at- 
tention, e.g., during genome annotation or gene regulatory stud- 
ies. This is despite previous findings from mouse and human that 
changes in OA and CA repeat lengths in promoters were associated 
with differences in gene expression (Hamada et al. 1984; Wang 
et al. 2005; Morris et al. 2010), and even though such repeats have 
been associated with several diseases including Lupus (Morris et al. 
2010) and cancer (Wang et al. 2005). How to treat dinucleotide 
repeats during sequence analyses and genome annotation and the 
functional roles of DRMs during gene regulation and potential 
mechanisms by which they contribute to enhancer activity are 
important questions for future studies. 

Methods 

STARR-seq experiment and data analysis 

We obtained Dwsophila neuronal ML-DmBG3-c2 (BG3) cells from 
the DGRC and performed STARR-seq and deep sequencing as de- 
scribed previously (Arnold et al. 2013) with the following excep- 
tions: BG3 cells were cultured in M3 BPYE medium supplemented 
with 10% PCS, 10 ng/mL insulin, 1% P/S at 25°C. Transfection of 
plasmld libraries (1 |j,g DNA/1 X 10* cells) was performed with 1 x 
10' cells at 70%-80% confluency using Gene Pulser MXcell Elec- 
troporation System (24-well plate; Bio-Rad; cat. no. 165-2682). 1 X 
10^ cells in 800 yiL K-PBS (inverted PBS) were subjected to each well 
(corresponding to a standard 0.4 mm electroporation cuvette), 
containing 10 |.Lg of plasmld library in 100 |iL EB. After 15 min 
incubation, BG3 cells were pulsed (500V-250nF-1000il). BG3 cells 
were spun down after electroporation and 6 X 10^ cells were 
resuspended in 10 mL growth medium. 

Read mapping and peak calling for the STARR-seq screens in 
BG3 cells were performed as described previously (Arnold et al. 
2013). We then defined cell type-specific enhancers as 401-bp re- 
gions centered on STARR-seq peak summits that showed a 2:2-fold 
enrichment over input (P-value < 0.001) in the cell type of interest 
and a <1. 41-fold enrichment (P-value > 0.1) in the other two cell 
types. We defined broadly active enhancers based on S2 cell STARR- 
seq peaks, if peaks with distances of <500 bp (summit-to-summit 
distances) were independently called also in the other two cell 
types with >:2-fold enrichment over input (P-value £ 0.001). We 
excluded enhancers that were <201 bp from an annotated trans- 
poson (Jurka 2000). 

PeaI<-to-gene assignment and GO analysis 

We assigned genes to enhancers, if the enhancers were located 
within 4 kb from the TSS of the genes. We then calculated the 
enrichment of Gene Ontology (GO) categories (Ashburner et al. 
2000) in each class of enhancers against the union of the other 



three classes. We used the false discovery rate (FDR) correction for 
multiple testing (R) to adjust P-values, discarded functional cate- 
gories with a corrected P-value > 0.05, and selected the top cate- 
gories after sorting by their fold enrichment. 

Definition of enhancer classes 

We defined stringent classes of cell type-specific and broadly ac- 
tive enhancers by taking the strongest 500 enhancers of each type 
(see above). We created a set of 1000 negative control regions by 
randomly selecting genomic regions without STARR-seq enrich- 
ment from the genome, preserving the genomic distribution of 
the functional STARR-seq enhancers according to coding se- 
quence (CDS), introns, etc. In addition, to determine sequence 
features that discriminate between the different enhancer classes 
rather than between enhancers and negative sequences, we cre- 
ated an additional control set (positive control). For each en- 
hancer class (i.e., S2-, BG3-, OSC-specific, and broadly active) we 
defined this positive control set as the union of all enhancers 
from the other three classes. For the motif analysis (see above), we 
separated the 500 enhancers of each of the four classes randomly 
into five nonoverlapping subsets, of which we used one subset for 
motif de novo discovery, two subsets for feature selection, 
and two subsets for SVM training and evaluation using LOOCV 
(see Fig. 2B). 

To define human regulatory regions, we first defined as 
broadly active human enhancers those intergenic regions that 
were nucleosome depleted in at least 120 cells or tissues by using 
DNase I hypersensitivity data (Thurman et al. 2012). Additionally, 
we used ChlP-seq data for the histone marks H3K4mel and 
H3K27ac (The ENCODE Project Consortium 2012) and defined as 
broadly active enhancers regions that had both histone marks in 
the following cell lines: GM12878, hESC, HMEC, HSMMt, HUVEC, 
K562, NHA, NHLF, NHEK, and Osteoblasts. In both cases we re- 
moved regions closer than 10 kb from a TSS and excluded regions 
that overlapped with annotated CpG islands (UCSC Genome 
Browser) to avoid any bias toward CG repeats. We obtained a total 
of 4285 and 789 regulatory regions using DHS and ChlP-seq data, 
respectively. As a control, we selected an equal amount of random 
intergenic regions. 

Analysis of gene expression 

For the gene expression analysis, we assigned enhancers to all 
genes with TSSs within 4 kb of the STARR-seq peak summit, 
allowing for zero, one, or several genes to be assigned to one en- 
hancer. We then considered only genes that were assigned exclu- 
sively to enhancers of a single enhancer class. We quantUe-normal- 
ized available RNA-seq data (The modENCODE Consortium et al. 
2010; Arnold et al. 2013) m R and visualized the data with a box plot 
that shows the median, the 25th, and 75th percentiles (boxes) and 
the 10th and 90th percentiles (whiskers). 

Motif de novo discovery 

We performed de novo motif discovery for the comparisons of 
each enhancer class against the respective positive and negative 
control sets (S2-spec. vs. Pos, S2-spec. vs. Neg, BG3-spec. vs. Pos, 
BG3-spec vs. Neg, OSC-spec. vs. Pos, OSC-spec. vs. Neg, Broad 
vs. Neg, and Broad vs. Pos). For this, we used the one-fifth of 
the class reserved for motif discovery (see above), and compared 
the sequences using DREME (Bailey 2011) and BioProspector 
(Liu et al. 2001). We then combined all resulting de novo motifs 
from all comparisons with known motifs (from Yanez-Cuna 
et al. 2012). 
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Motif matching and motif count-based enhancer predictions 

For the motif-based enhancer predictions, we first counted the oc- 
currences of all TF motifs in the enhancers of each class as before 
(Yanez-Cuna et al. 20 1 2) with a position weight matrix (PWM) cutoff 
P < 9.76 X 10" (1/4096). We performed feature selection by back- 
ward elimination as described by Yafiez-Cuna et al. (20 1 2) using only 
motif counts from two-fifths of the regions. We then assessed the 
prediction accuracy by LOOCV, fivefold, 10-fold, 15-fold, and 20- 
fold cross-validation on the remaining two-fifths of the enhancers. 
By separating our set into nonoverlapping subsets for motif discov- 
ery, feature selection, and evaluation and by predicting previously 
unseen regions, we eliminate the risk of overfitting (Yuan et al. 
2007). The SVM was done using libSVM (Chang and Lin 201 1) using 
linear kemel and default parameters. We used as features the motifs 
mentioned above with three different PWM cutoffs (3.9 x 10"^; 
1/256, 9.76 X 10 ^- 1/1024, and 2.44 X 10 1/4096) and as attri- 
butes the number of instances within each region for each particular 
PWM cutoff. For the analysis of enrichment and depletion of DRMs 
in human regulatory regions, we search for DRMs 6 bp long and 
compare the amount of DRMs instances against negative regions. 

Visualization of differential motif enrichments 

To visualize the motif enrichments in each of the classes, we counted 
the occurrences of all TF motifs in the enhancer sequences of each 
class and the respective control sets using the same PWM cutoffs as 
above (P < 9.76 x 10"" fl/4096]). Then, we detennined the en- 
richment of each motif between each of the enhancer classes and 
their control sets and visualized significant enrichments using 
a heatmap representation. Because the visualization is independent 
of the motif-based enhancer predictions, we reused the four-fifths of 
the enhancers from each class not used for motif de novo discovery. 

Luciferase assay 

To test the functional importance of certain sequence motifs, we used 
luciferase assays to compare wild-type enhancer sequences with 
mutant versions in which we disrupted motifs that were predicted to 
be important. We selected wild-type enhancers that were confidently 
predicted by our approach (Yafiez-Cuna et al. 2012) and for which we 
could design primers that defined a <500-bp enhancer region. This 
allowed the PGR amplification of the wild-type sequences from ge- 
nomic DNA and the chemical synthesis of the mutant versions and of 
the synthetic enhancers as Integrated DNA Technologies (IDT) 
gBlocks. We chose three S2-specific enhancers, three OSC-spedfic 
enhancers, five broadly active enhancers, and three human en- 
hancers. All Drosophila sequences were cloned into pCR8-TOPO-GW 
and then shuttled to pGL3_GW_luc-i- (Arnold et al. 2013) using 
Gateway LR Clonase II (Invitrogen) recombination and verified by 
Sanger sequencing. One hundred thousand Drosophila cells were co- 
transfected with the respective firefly constructs (100 ng) and RenUla 
control plasmid ubi-63E-RL (10 ng) using Fugene HD Transfection 
Reagent (Promega; for OSCs and BG3 cells) and JetPei (Polyplus 
transfection; for S2 cells). Himicin enhancers were cloned into modi- 
fied pGL4-Promotor plasmid (Gateway-cassette was added between 
Kpnl and Bglll sites and minimal promoter between Bglll and Hindlll). 
Human genomic DNA for PGR amplification of wild-type enhancers 
was purchased from Promega. Fifteen thousand HeLaS3 cells were 
assayed as described above, using X-treme HP DNA (Roche) as 
a transfection reagent and pGL4.75 (Promega) as a transfection con- 
trol. Enhancer activity was measured by luciferase assay using the Dual 
luciferase kit (Promega) according to the manufacturer's instructions. 

Data access 

All deep sequencing data from this study have been submitted to 
the NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm. 



nih.gov/geo/) under series accession nimiber GSE49809 and are also 
available at http://www.starklab.org. 
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